Audience: Diverse Background
Time: 1 day workshop (6 hours)
Pre-Requisites: Prior experience with the python programming language is essential: this is not an Introduction to Python. Basic competency is assumed. If you have not use python before consider taking: Intro to Python (Data Science Campus) or data camp courses prior to attending.
Brief Description: Natural Language Processing is a sub-field of Artificial Intelligence. It is used for processing and analysing large amounts of natural language (linguistics). Some applications include search engines (Google), text classification (spam filters), identifying sentiments for a product (sentiment analysis), methods for discovering abstract topics in a collection of documents (topic modelling) and machine translation technologies. This is an Introduction to Natural Language Processing, and thus the main concepts are about cleaning, exploring datasets, and applying feature engineering techniques to transform text data into numerical data.
Aims, Objectives and Intended Learning Outcomes: This module will provide a introduction to the Natural Language Processing field using Python programming language. It covers some basic terminology, the process of ‘cleaning’ a dataset, exploring it and applying simple feature engineering techniques to transform the data. By the end of the module learners will understand and apply the necessary steps to ‘clean’, explore and transform their dataset in the appropriate order.
Dataset: Patent Dataset, Hep Dataset (High_Energy_Physics), Spam/Ham
Libraries: Before attending the course please make sure that you read the course instructions that you received.
Acknowledgements: Many thanks to Savvas Stephanides for joining me on a pair programming approach to create the function that performs the text preprocessing and for his code review. Many thanks to Joshi Chaitanya that has provided Hep Dataset and some of his code for this course, to Ian Grimstead and Thanasis Anthopoulos for providing the Patent Dataset, to Gareth Clews, Isabela Breton and Dan Lewis for reviewing the material and the code and Dave Pugh for lending the Regex material. Also thanks to everyone who attended the pilot course to provide feedback about the course.
Intended Learning Outcomes: By the end of Chapter 1 you will be able to:-
Describe what is special about human language
List the major levels of linguistic structure
Describe how language processing can be challenging
Define areas where progress has been made/has not been in language processing
Describe the work procedure for this course
Phonetics
Production of speech sounds by humans
Phonology
Patterns of sounds in a language and across languages
Why do related forms differ? Sane—Sanity. Electric—Electricity/ Atom—Atomic Phonology finds the systematic ways in which the forms differ and explains them.
Syntax
Structure of language
Semantics
Meaning conveyed in language
“How much Chinese silk was exported to Western Europe by the end of the 18th century?”
To answer this question, we need to know something about lexical semantics, the meaning of all the words (export or silk as well as compositional semantics (what exactly constitutes Western Europe as opposed to Eastern or Southern Europe), what does end mean when combined with the 18th century. We also need to know something about the relationship of the words to the syntactic structure. For example, we need to know that by the end of the 18th century is a temporal end-point.
* Morphology
The way words break down into component parts that carry meanings like singular versus plural
Use of language in social contexts” (Nordquist, 2017)
From a pragmatic point of view, transmission of meaning is a multifaceted phenomenon that “not only depends on structural and linguistic knowledge […],but also on the context of [each] utterance.” (Wikipedia contributors, 2017)
“I don’t have any money” What Does this mean ?
Ambiguity
Most tasks in speech and language processing can be viewed as resolving ambiguity at one of these levels.
“I made her duck”
I cooked waterfowl for her
I cooked waterfowl belonging to her
I created the (plaster?) duck she owns
I caused her to quickly lower her head or body
I waved my magic wand and turned her into undifferentiated waterfowl
(Jurafsky and Martin, 2019)
Coreference resolution
“How many states were in the United States that year?”
What year is that year?
This task of coreference resolution makes use of knowledge about how words like that or pronouns like it or she refer to previous parts of the discourse.
Other Challenges
(Jurafsky and Martin, 2019)
Machine Translation Technologies
Challenge: preserve the meaning of the sentence from one language to the other
Search Engines eg. Google
Challenge: recognize natural language questions, extract the meaning of the question and give an answer
Text Classification eg. Spam Filters
Challenge: Overcome False Negatives and False Positives ie. sending to spam folder non-spam emails and vice-versa
Sentiment Analysis eg. identify sentiments for a product
Challenge: understanding sarcasm and ironic comments
Topic Modelling: method for discovering the abstract topics in a document collection
Challenge: using a robust algorithm, sacrifice speed over accuracy?
Transcription of speech (turning spoken language into written languages)
Challenge: dealing with looser grammar
Question Answering: build systems that automatically answer questions posed by humans in a natural language.
Challenge: understanding the infinitely varied forms of expression
Progress Made
(Jurafsky and Martin, 2019)
The task is difficult! What tools do we need?
(Jurafsky and Martin, 2019)
Steps
Have a dataset
Text preprocessing (Data Cleaning)
Exploratory Analysis and Data Transformation
Split the Dataset (Data Scientists may prefer to do the exploratory analysis after they split the Dataset)
Identify the technique that is most suitable for your Dataset and what you may think can take out of it. Use this on the Train Dataset eg Topic Modelling
Explore different features of the model on the Validate Dataset (Tuning)
Test the accuracy and the robustness of your model
Communicate your results
Make a prediction, if it is possible
Note This is an Introduction to Natural Language Processing, and thus anything after 3. is beyond this course.
nltk and spaCy are the two Python packages that some data scientists have strong feelings in favour of one or the other. In this course we will only deal nltk. It is considered that nltk can be used for teaching and understanding but it is slow. spaCy, on the other hand, is considered fast and more robust.
The first edition of the book, published by O’Reilly, is available at http://nltk.org/book_1ed/ .
The official website is https://spacy.io and the source code on github is available at https://github.com/explosion/spaCy.
“I can say something in a natural language that no one has ever said in the history of the universe” True or False ? Give a reason for your answer.
Draw a syntax tree for:
“The chef cooks the soup”
“Max eat a green apple” Is this an example of compositional semantics ? Give a reason for your answer.
“I feel sick today, I dont want to go to work, what do you think Siri ?” What type of NLP application is this? Why would it be difficult to answer ?
“I went to the bank.” How would such a sentence be difficult for a language proceessing application. What measure could be taken to overcome the issue ?
Intended Learning Outcomes: By the end of Chapter 2, learners will be able to:
explain the concept of text-preprocessing,
perform the following steps to a dataset:
lowercase,
tokenize,
lemmatization,
removing stop words and punctuation and
performing Part-of-Speech Tagging.
differentiate between lemmatization and stemming.
The data comes in raw form. It may include unnecessary information and/or may not have the form that we need to start processing it.
Text preprocessing removes unnecessary information and changes the data into a form that the machine can process and provide meaningful results.
It is performed before the dataset is split into categories and before the modelling techniques are applied.
Preprocessing ‘cleans’ the data so that the machine (methods) will be able to read and process it. Otherwise, it would not be possible to do that and provide a meaningful outcome
We have the sentence: ‘The language we use influences the way we think. This is the principle that underlies “Whorfianism”. From 1980 onwards, this view has been subject of increased scrutiny and skepticism!’
To do this we first need to import the code into Python:
my_sentence = 'The language we use influences the way we think. This is the principle that underlies "Whorfianism". From 1980 onwards, this view has been subject of increased scrutiny and skepticism!'
print(my_sentence)
## The language we use influences the way we think. This is the principle that underlies "Whorfianism". From 1980 onwards, this view has been subject of increased scrutiny and skepticism!
Exercise: Create a string my_opinion = ‘Natural Language Processing is a key component of Artificial Intelligence.’ or express your own short opinion.
Quite often the same word in a text can be written with capital or lowercase letters, eg “Natural” or “natural”. In NLP, they could be recognised as two different words. Thus, converting everything to lowercase will ensure that this does not happen.
lowercase_sentence = my_sentence.lower()
print(lowercase_sentence)
## the language we use influences the way we think. this is the principle that underlies "whorfianism". from 1980 onwards, this view has been subject of increased scrutiny and skepticism!
Exercise: Convert my_opinion to lowercase.
Tokenization is the process of splitting a string of word(s) into pieces (or tokens), eg the tokens of the phrase ‘My house’ are: ‘My’ and ‘house’.
Tokenization makes it easier to process every word eg find its frequency.
tokens_from_sentence = nltk.word_tokenize(lowercase_sentence)
print(tokens_from_sentence)
## ['the', 'language', 'we', 'use', 'influences', 'the', 'way', 'we', 'think', '.', 'this', 'is', 'the', 'principle', 'that', 'underlies', '``', 'whorfianism', "''", '.', 'from', '1980', 'onwards', ',', 'this', 'view', 'has', 'been', 'subject', 'of', 'increased', 'scrutiny', 'and', 'skepticism', '!']
Exercise: Tokenize my_opinion.
Note: Line Segmentation
example_sentences = """Natural languages can’t be directly translated into a precise set of mathematical operations, but they do contain information and instructions that can be extracted. Those pieces of information and instruction can be stored, indexed, searched, or immediately acted upon. One of those actions could be to generate a sequence of words in response to a statement"""
sentence_segments = nltk.sent_tokenize(example_sentences)# breaks the sentence after every !!!, or is it?
print(sentence_segments)
## ['Natural languages can’t be directly translated into a precise set of mathematical operations, but they do contain information and instructions that can be extracted.', 'Those pieces of information and instruction can be stored, indexed, searched, or immediately acted upon.', 'One of those actions could be to generate a sequence of words in response to a statement']
POS Tagger reads text and assigns part of speech text to the words eg adjective, verb, noun.
tokens_with_part_of_speech_tag = nltk.pos_tag(tokens_from_sentence)
print(tokens_with_part_of_speech_tag)
## [('the', 'DT'), ('language', 'NN'), ('we', 'PRP'), ('use', 'VBP'), ('influences', 'NNS'), ('the', 'DT'), ('way', 'NN'), ('we', 'PRP'), ('think', 'VBP'), ('.', '.'), ('this', 'DT'), ('is', 'VBZ'), ('the', 'DT'), ('principle', 'NN'), ('that', 'IN'), ('underlies', 'VBZ'), ('``', '``'), ('whorfianism', 'NN'), ("''", "''"), ('.', '.'), ('from', 'IN'), ('1980', 'CD'), ('onwards', 'NNS'), (',', ','), ('this', 'DT'), ('view', 'NN'), ('has', 'VBZ'), ('been', 'VBN'), ('subject', 'JJ'), ('of', 'IN'), ('increased', 'JJ'), ('scrutiny', 'NN'), ('and', 'CC'), ('skepticism', 'NN'), ('!', '.')]
POS Tagger reads text and assigns part of speech text to the words eg adjective, adverb.
JJ: adjective
NN: noun
NNP:proper noun (a name)
IN: preposition
VBZ: verb, 3rd person sing. present (walks)
VBP: verb, non-3rd person singular present
DT: determiner
JJS: adjective superlative (tallest)
RB: adverb (quietly)
CD: cardinal digit
CC: Coordinating Conjunction
PRP: Personal Pronoun
How to keep noun, adjective, verb and adverb
new_sentence = [each_token[0] for each_token in tokens_with_part_of_speech_tag if each_token[1] in ["JJ", "NN", "VB","RB"]]
# JJ (adjective), NN (noun), NNP (proper noun), RB (adverb), VB (verb)
print(new_sentence)
## ['language', 'way', 'principle', 'whorfianism', 'view', 'subject', 'increased', 'scrutiny', 'skepticism']
However, nltk does not “think” the same way as humans. So,
important_words = [each_token[0] for each_token in tokens_with_part_of_speech_tag if each_token[1] in ["JJ", "JJR", "JJS", "NN", "NNS", "NNP", "NNPS", "RB", "RBR", "RBS", "VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]]
#JJ adjective 'big', JJR adjective, comparative 'bigger', JJS adjective, superlative 'biggest'
#NN noun, singular 'desk', NNS noun plural 'desks', NNP proper noun, singular 'Harrison', NNPS proper noun, plural 'Americans'
#RB adverb very, silently, RBR adverb, comparative better, RBS adverb, superlative best
#VB verb, base form take, VBD verb, past tense took, VBG verb, gerund/present participle taking, VBN verb, past participle taken, VBP verb, sing. present, non-3d take, VBZ verb, 3rd person sing. present takes
print(important_words)
## ['language', 'use', 'influences', 'way', 'think', 'is', 'principle', 'underlies', 'whorfianism', 'onwards', 'view', 'has', 'been', 'subject', 'increased', 'scrutiny', 'skepticism']
Note:
Notice the difference from having simple adjective, verb, noun and adverb words? You don’t have to use all these categories in this course, but be aware. In the Appendix there is a list with all the POS-Tagging nltk categories.
Exercises
Do POS tagging to the tokens of my_opinion.
Create a new sentence from my_opinion by keeping the nouns and verbs only.
Simple Explanation: Stemming is the process of reducing the word back to its stem (removing prefix and suffix). Even if the stem itself is not necessarily a valid root.
Formal Explanation Stemming - the process of reducing inflected (or sometimes derived) words to their word stem; that is, their base or root form. For example, the words; argue, argued, argues, arguing reduce to the stem argu. Usually stemming is a crude heuristic process that chops off the ends of words in the hope of achieving the root correctly most of the time.
Stemming aims to remove the excess part of the word to be able to identify words that are similar.
stemmer = PorterStemmer() #Define the Stemmer- it is a stemming algorithm (since 1979)
stemmed_sentence = map(stemmer.stem, tokens_from_sentence) #apply the stemming algorithm to the Tokens_from_Sentence
#map() applies the function func to all the elements of the sequence seq. The first argument func is the name of a function and the second a sequence (e.g. a list) seq.
print(list(stemmed_sentence))
## ['the', 'languag', 'we', 'use', 'influenc', 'the', 'way', 'we', 'think', '.', 'thi', 'is', 'the', 'principl', 'that', 'underli', '``', 'whorfian', "''", '.', 'from', '1980', 'onward', ',', 'thi', 'view', 'ha', 'been', 'subject', 'of', 'increas', 'scrutini', 'and', 'skeptic', '!']
Question: What do you think of the Stemming? When could it prove useful ?
Exercise: Do Stemming to the tokens of my_opinion.
Note: The stem of the word “beginners” is “beginn”, but the stem of the word “begins” is “begin”.
Simple Explanation: The process of converting a word to its dictionary form eg women will become woman, walking will become walk.
Formal Explanation: Lemmatisation uses vocabulary and morphological analysis of words to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. Most lemmatisers achieve this using a lookup table and so this process, when you have large volumes of text may be slower than stemming. However, if it is a suitable application for your data then lemmatising is generally the recommended approach to take.
If confronted with the token ‘saw’, stemming might return just ‘s’, whereas lemmatisation would attempt to return either ‘see’ or ‘saw’ depending on whether the use of the token was as a verb or a noun.
We could start to build our own stemming function using rules such as:
if the word ends in ‘ed’, remove the ‘ed’
if the word ends in ‘ing’, remove the ‘ing’
if the word ends in ‘ly’, remove the ‘ly’.
This might work for stemming but lemmatising is a far more complex challenge as you have to generate a whole database of the english language which understands word morphology.
But there is good news - someone has already done all the hard work for us!
Lemmatizing aims to remove the excess part of the word to be able to identify words that are similar.
Lemmatization and Stemming: Stemming operates on each word without considering the context and it cannot discriminate between different word meaning. Lemmatization, however, takes into account the part of speech and the context.
Example:
“better”: has “good” as its lemma and “better” as its stem “walking”: has “walk” as its lemma and stem “meeting”: can be either the base a noun or a verb depending on the context, eg. “in our last meeting” or “We are meeting again tomorrow”. Lemmatization can select the appropriate lemma based on the context, unlike stemming.
wordnet_lemmatizer = WordNetLemmatizer()#lexical database
#parts_of_speech = [wordnet.ADJ, wordnet.ADV, wordnet.NOUN, wordnet.VERB]
noun_lemma = wordnet_lemmatizer.lemmatize(tokens_from_sentence[4], pos=wordnet.NOUN)
print(noun_lemma)
## influence
Questions:
Why does this differ from Stemming?
Can you think of any more words that will change with lemmatization?
Exercise: Do Lemmatization to the tokens of my_opinion.
Note: Lemmatisation uses a lookup table to return things to their roots, stemming purely cuts off text from the string which is far less robust than Lemmatisation. However, Stemming is nice if you have lots of typos and words that are out of dictionary. Data Scientists have different approaches when it comes to Stemming and Lemmatizing. The rule of thumb is to do one or the other, not both at the same time.
stop_words = set(stopwords.words("english"))
print(stop_words)
## {'ve', 'him', 'we', 'more', 'just', 'once', 'himself', 'i', 'then', 'same', "isn't", 'during', 'other', 'while', 'be', 'out', 'their', 'against', 'had', 'than', 'were', 're', 'to', 'don', 'why', 'some', 'and', 'such', 'mustn', 'with', "wouldn't", 'under', 'no', "don't", "mustn't", 'by', "didn't", 'hadn', 'after', 'only', 'hers', 'theirs', 'for', 'both', 'if', 'shouldn', 'the', 'does', 'in', "haven't", "she's", 'haven', 'weren', 'of', 'o', 'yourselves', "couldn't", 'wouldn', 'your', 'our', 'from', 'them', 'nor', "hadn't", 'those', 'down', "you're", 'who', 'was', "should've", 'these', "aren't", 'been', 'has', 'own', 'shan', 'can', 'it', 'aren', 'you', 'which', 'whom', "mightn't", "hasn't", 'ourselves', 'needn', 'before', 'y', 'have', 'won', "won't", 'wasn', "shouldn't", 'now', 'm', "needn't", 'up', 'themselves', 'her', 'itself', 'didn', 'mightn', 'll', "shan't", 'but', 'is', 'isn', 's', 'should', 'hasn', 'doing', 'how', 'me', "weren't", 'very', 'because', 'so', 'on', 'am', 'when', 'further', 'being', 'too', 't', 'most', 'yours', 'above', 'at', 'his', 'she', 'again', 'that', 'until', 'd', 'any', 'as', 'he', 'are', 'couldn', 'doesn', 'a', "doesn't", 'what', "wasn't", 'did', 'ain', "it's", 'this', 'each', 'here', 'where', 'few', 'my', 'myself', 'not', 'over', 'through', 'they', "you've", 'herself', 'there', 'will', 'below', "you'll", 'do', 'between', 'ours', 'having', 'or', 'all', "that'll", 'an', 'off', 'its', 'ma', 'yourself', 'about', 'into', "you'd"}
print(tokens_from_sentence)
## ['the', 'language', 'we', 'use', 'influences', 'the', 'way', 'we', 'think', '.', 'this', 'is', 'the', 'principle', 'that', 'underlies', '``', 'whorfianism', "''", '.', 'from', '1980', 'onwards', ',', 'this', 'view', 'has', 'been', 'subject', 'of', 'increased', 'scrutiny', 'and', 'skepticism', '!']
tokens_without_stopwords = [each_token for each_token in tokens_from_sentence if each_token not in stop_words]
print(tokens_without_stopwords)
## ['language', 'use', 'influences', 'way', 'think', '.', 'principle', 'underlies', '``', 'whorfianism', "''", '.', '1980', 'onwards', ',', 'view', 'subject', 'increased', 'scrutiny', 'skepticism', '!']
#string.punctuation is a list
#str.maketrans creates a translation table
#all the punctuations in string.punctuation in the translation table are called as NONE. it means when it used and it identifies the number of the punctuation, it will remove it. This is what happens in the loop below.
punctuation_table = str.maketrans({key: None for key in string.punctuation})
tokens_without_punctuation = [token.translate(punctuation_table) for token in tokens_from_sentence]
print(tokens_without_punctuation)
## ['the', 'language', 'we', 'use', 'influences', 'the', 'way', 'we', 'think', '', 'this', 'is', 'the', 'principle', 'that', 'underlies', '', 'whorfianism', '', '', 'from', '1980', 'onwards', '', 'this', 'view', 'has', 'been', 'subject', 'of', 'increased', 'scrutiny', 'and', 'skepticism', '']
Note: In the above example we have removed punctuation from tokens. A very easy way to remove punctuation from a list is the following:
for each_punctuation_mark in string.punctuation:
my_sentence = my_sentence.replace(each_punctuation_mark,"")
print(my_sentence)
## The language we use influences the way we think This is the principle that underlies Whorfianism From 1980 onwards this view has been subject of increased scrutiny and skepticism
Exercise: Remove StopWords from the tokens of my_opinion.
print(tokens_from_sentence) #Recall the Tokens in the Sentence
## ['the', 'language', 'we', 'use', 'influences', 'the', 'way', 'we', 'think', '.', 'this', 'is', 'the', 'principle', 'that', 'underlies', '``', 'whorfianism', "''", '.', 'from', '1980', 'onwards', ',', 'this', 'view', 'has', 'been', 'subject', 'of', 'increased', 'scrutiny', 'and', 'skepticism', '!']
alphabetic_tokens = [token for token in tokens_from_sentence if token.isalpha()]#Remove anything that is not alphabetic
print(alphabetic_tokens)
## ['the', 'language', 'we', 'use', 'influences', 'the', 'way', 'we', 'think', 'this', 'is', 'the', 'principle', 'that', 'underlies', 'whorfianism', 'from', 'onwards', 'this', 'view', 'has', 'been', 'subject', 'of', 'increased', 'scrutiny', 'and', 'skepticism']
large_tokens = [token for token in tokens_from_sentence if len(token) > 2]#Remove short words (2 character words)
print(large_tokens)
## ['the', 'language', 'use', 'influences', 'the', 'way', 'think', 'this', 'the', 'principle', 'that', 'underlies', 'whorfianism', 'from', '1980', 'onwards', 'this', 'view', 'has', 'been', 'subject', 'increased', 'scrutiny', 'and', 'skepticism']
Exercises:
Remove any other words from the tokens of my_opinion.
Have you found a convenient step order? (You can take look at the next section if you want. It is not cheating!!!)
Note:
more_stopwords = {'Beginners','Ever'}
extended_stopwords_set = set(stopwords.words('english')) | more_stopwords
print(extended_stopwords_set)
## {'ve', 'him', 'we', 'more', 'just', 'once', 'himself', 'i', 'then', 'same', "isn't", 'during', 'other', 'while', 'be', 'out', 'their', 'against', 'had', 'than', 'were', 're', 'to', 'don', 'why', 'some', 'and', 'such', 'mustn', 'with', "wouldn't", 'under', 'no', "don't", "mustn't", 'by', "didn't", 'hadn', 'after', 'only', 'hers', 'theirs', 'for', 'both', 'if', 'shouldn', 'the', 'Ever', 'does', 'in', "haven't", "she's", 'haven', 'weren', 'of', 'o', 'yourselves', "couldn't", 'wouldn', 'your', 'our', 'from', 'them', 'nor', "hadn't", 'those', 'down', "you're", 'who', 'was', "should've", 'these', "aren't", 'been', 'has', 'own', 'shan', 'can', 'it', 'aren', 'you', 'which', 'whom', "mightn't", "hasn't", 'ourselves', 'needn', 'before', 'y', 'have', 'won', "won't", 'wasn', "shouldn't", 'now', 'm', "needn't", 'up', 'themselves', 'her', 'itself', 'didn', 'mightn', 'll', "shan't", 'but', 'is', 'isn', 's', 'should', 'hasn', 'doing', 'how', 'me', "weren't", 'very', 'because', 'so', 'on', 'am', 'when', 'further', 'being', 'too', 't', 'most', 'yours', 'above', 'at', 'his', 'she', 'again', 'that', 'until', 'd', 'any', 'as', 'he', 'are', 'couldn', 'doesn', 'a', "doesn't", 'what', "wasn't", 'did', 'ain', "it's", 'this', 'each', 'here', 'where', 'few', 'my', 'myself', 'not', 'over', 'through', 'they', "you've", 'herself', 'there', 'will', 'below', "you'll", 'do', 'between', 'ours', 'having', 'or', 'all', "that'll", 'an', 'off', 'its', 'ma', 'yourself', 'about', 'into', 'Beginners', "you'd"}
stopwords_to_stay_in_dataset = {"won't", "wouldn't"}
updated_stopwords_set = set(stopwords.words('english')) - stopwords_to_stay_in_dataset
print(updated_stopwords_set)
## {'ve', 'him', 'we', 'more', 'just', 'once', 'himself', 'i', 'then', 'same', "isn't", 'during', 'other', 'while', 'be', 'out', 'their', 'against', 'had', 'than', 'were', 're', 'to', 'don', 'why', 'some', 'and', 'such', 'mustn', 'with', 'under', 'no', "don't", "mustn't", 'by', "didn't", 'hadn', 'after', 'only', 'hers', 'theirs', 'for', 'both', 'if', 'shouldn', 'the', 'does', 'in', "haven't", "she's", 'haven', 'weren', 'of', 'o', 'yourselves', "couldn't", 'wouldn', 'your', 'our', 'from', 'them', 'nor', "hadn't", 'those', 'down', "you're", 'who', 'was', "should've", 'these', "aren't", 'been', 'has', 'own', 'shan', 'can', 'it', 'aren', 'you', 'which', 'whom', "mightn't", "hasn't", 'ourselves', 'needn', 'before', 'y', 'have', 'won', 'wasn', "shouldn't", 'now', 'm', "needn't", 'up', 'themselves', 'her', 'itself', 'didn', 'mightn', 'll', "shan't", 'but', 'is', 'isn', 's', 'should', 'hasn', 'doing', 'how', 'me', "weren't", 'very', 'because', 'so', 'on', 'am', 'when', 'further', 'being', 'too', 't', 'most', 'yours', 'above', 'at', 'his', 'she', 'again', 'that', 'until', 'd', 'any', 'as', 'he', 'are', 'couldn', 'doesn', 'a', "doesn't", 'what', "wasn't", 'did', 'ain', "it's", 'this', 'each', 'here', 'where', 'few', 'my', 'myself', 'not', 'over', 'through', 'they', "you've", 'herself', 'there', 'will', 'below', "you'll", 'do', 'between', 'ours', 'having', 'or', 'all', "that'll", 'an', 'off', 'its', 'ma', 'yourself', 'about', 'into', "you'd"}
def clean_up_text(text):
tokens = split_text_to_tokens(text)
tokens = clean_up_tokens(tokens)
processed_text = " ".join(tokens)
return processed_text
def split_text_to_tokens(text):
return nltk.word_tokenize(text)
def clean_up_tokens(tokens):
tokens = remove_punctuation_from_tokens(tokens)
tokens = remove_non_alphabetic_tokens(tokens)
tokens = set_tokens_to_lowercase(tokens)
tokens = remove_stopwords_from_tokens(tokens)
tokens = remove_small_words_from_tokens(tokens)
tokens = lemmatize_tokens(tokens)
tokens = remove_unimportant_words_from_tokens(tokens)
return tokens
def remove_punctuation_from_tokens(tokens):
translation_table = str.maketrans({key: None for key in string.punctuation})
text_without_punctuations = []
for each_token in tokens:
text_without_punctuations.append(each_token.translate(translation_table))
return text_without_punctuations
def remove_non_alphabetic_tokens(tokens):
alphabetic_tokens = []
for token in tokens:
if token.isalpha():
alphabetic_tokens.append(token)
return alphabetic_tokens
def set_tokens_to_lowercase(tokens):
lowercase_tokens = []
return [each_token.lower() for each_token in tokens]
def remove_stopwords_from_tokens(tokens):
stop_words = set(stopwords.words("english"))
return [each_token for each_token in tokens if each_token not in stop_words]
def remove_small_words_from_tokens(tokens):
return [each_token for each_token in tokens if len(each_token) > 2]
def remove_unimportant_words_from_tokens(tokens):
lemmatized_tokens = lemmatize_tokens(tokens)
tokens_with_part_of_speech_tags = nltk.pos_tag(lemmatized_tokens)
cleared_token_list = [each_token[0] for each_token in tokens_with_part_of_speech_tags if each_token[1] in ["JJ", "JJR", "JJS", "NN", "NNS", "NNP", "NNPS", "RB", "RBR", "RBS", "VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]]
# JJ (adjective), NN (noun), NNP (proper noun), RB (adverb), VB (verb)
return cleared_token_list
def lemmatize_tokens(tokens):
wordnet_lemmatizer = WordNetLemmatizer()
parts_of_speech = [wordnet.ADJ, wordnet.ADJ_SAT, wordnet.ADV, wordnet.NOUN, wordnet.VERB]
lemmatized_tokens = tokens
for each_part_of_speech in parts_of_speech:
lemmatized_tokens = [wordnet_lemmatizer.lemmatize(each_token, pos=each_part_of_speech) for each_token in lemmatized_tokens]
return lemmatized_tokens
def preprocess(pstr1):
s=split_text_to_tokens(pstr1)
s=remove_non_alphabetic_tokens(s)
s=remove_punctuation_from_tokens(s)
s=set_tokens_to_lowercase(s)
return s
my_opinion = 'The NLP techniques you’ll learn, are powerful enough to create machines that can surpass humans in both accuracy and speed for some surprisingly subtle tasks. For example, you might not have guessed that recognizing sarcasm in an isolated Twitter message can be done more accurately by a machine than by a human. Don’t worry, humans are still better at recognizing humor and sarcasm within an ongoing dialog, due to our ability to maintain information about the context of a statement. But machines are getting better and better at maintaining context.'
clean_opinion = clean_up_text(my_opinion)
print(clean_opinion)
## nlp technique learn powerful enough create machine surpass human accuracy speed surprisingly subtle task example guess recognize sarcasm isolate twitter message do accurately machine human worry human still good recognize humor sarcasm ongoing dialog due ability maintain information context statement machine get good good maintain context
Exercises:
Use the clean_up_text() to do text-preprocessing to the sentences ‘The song of Ariana Grande has been number one hit on the charts for the last 3 months. When will Ed Sheeran become number 1 again?’
Import the Hep Dataset and do the text-preprocessing as we learnt.
Hint: The Hep Dataset is a pickle file. Make sure that your workspace is in the same directory as your dataset. To import a pickle file use the following code:
import pickle
import pandas as pd
high_energy_physics_dataset = pd.read_pickle("./Hep_Dataset.pkl")
print(high_energy_physics_dataset.head(1))
## Text ... Theory-HEP
## 0 [Dark Matter and Gauge Coupling Unification in... ... 0
##
## [1 rows x 8 columns]
Words like Ph.D that have a ., but the sentence does not finish would require an exception function. Additionally words like don’t, won’t also need to be handled with caution.
Using different methods for lemmatization may give different results- staying consistent throughout your work will ease your processing and will not mess with your results
Usually stemming is not preferred. If you do want to use stemming to help you find more words that are closely related, then it would be better if you keep the stemmised and the non-stemmised version of the word. This will help you present the results as the end.
Intended Learning Outcomes: By the end of Chapter 3, it is expected that you will
Describe the 4 key techniques in corpus linguitics
Extract raw frequencies, concodance, collocations and keyness from the corpus under study
calculate lexical diversity.
View Lexical dispersion on selected tokens.
Appreciate the benefits that corncordance tools can bring to linguistics analysis.
You will also be able to find the most frequent words, plot a frequency distribution plot.
As the name suggests, you’re exploring – looking for clues!
Tukey (1977) calls it: “detective work”
As the name suggests, you’re exploring – looking for clues. For example establishing the data’s underlying structure, identifying mistakes and missing data, establishing the key variables, spotting anomalies, checking assumptions and testing hypotheses in relation to a specific model.
EDA is used in conjuction with Confirmatory Data Analysis where you evaluate your evidence using traditional statistical tools such as significance, inference, and confidence.
Corpora
Corpus linguistics is a field which focuses upon a set of methods, for studying language. It is the scientific study of language on the basis of text corpora . It is not a monolithic, consensually agreed set of methods and procedures. It is a heterogeneous field – although there are some basic generalisations that we can make.
Corpus linguistics invloves gathering a corpus (homogenuous, of a particular genre).A corpus (plural corpora) is a collection of texts used for linguistic analyses. Such corpora generally comprise hundreds of thousands to billions of words and are not made up of the linguist’s or a native speaker’s invented examples but based on authentic naturally occurring spoken or written usage.
* Word Frequency Analysis
* Concordance
* Collocation
* Keyness
A simple tallying of the number of instances of something that occurs in a corpus
Tally
Zipf noticed that the second most common word ‘of’ occurs about half as often as the most common word ‘the’. While the third most common word ‘to’ occurs about a third as often as ‘the’. And so on.
More generally, the frequency of the nth most common word is about 1/n times the frequency of the most common word.
So a graph of the frequencies of the most common words looks roughly like this:
Tally Graph
Language after language, corpus after corpus, linguistic type after linguistic type, . . . we observe the same “few giants, many dwarves” pattern.
The most basic statistical measure is a frequency count, as shown above. Tere are 1,103 examples of the word Lancaster in the written section of the BNC. This maybe expressed as a percentage of the whole corpus; the BNC’s written section contains 87,903,571 words of running text, meaning that the word Lancaster represents 0.013% of the total data in the written section of the corpus. The percentage is just another way of looking at the count 1,103 in context, to try to make sense of it relative to the totality of the written corpus.
Sometimes, as is the case here, the percentage may not convey meaningfully the frequency of use of the word, so we might instead produce a normalised frequency (or relative frequency), which answers the question how often might we assume we will see the word per x words of running text?’ Normalised frequencies are usually given per thousand words or per million words.
(McEnery and Hardie, 2012)
Import Data (Spam/Ham Dataset) https://archive.ics.uci.edu/ml/datasets/sms+spam+collection
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged according being ham (legitimate) or spam.
Here’s some code
raw_data = pd.read_csv("C:/IR Course/NLP_Intro/SMSSpamCollection.csv", encoding='iso-8859-1')
raw_data = pd.read_csv("C:/IR Course/NLP_Intro/SMSSpamCollection.csv", encoding='iso-8859-1')
raw_data["Email"].value_counts().plot(kind = 'pie', explode = [0, 0.1], figsize = (6, 6), autopct = '%1.1f%%', shadow = True)
plt.ylabel("Spam vs Ham")
plt.legend(["Ham", "Spam"])
plt.show()
def getsampledata(pdf, psamp):
types = ['spam', 'ham']
allsamples = pd.DataFrame()
for i in types:
data1 = pdf[pdf.Email == i]
rows = np.random.choice(data1.index.values, psamp)
sampled_data = pdf.loc[rows]
allsamples = allsamples.append(sampled_data, ignore_index=True)
return allsamples
samp_data = getsampledata (raw_data, 5)
def populatedictcorpus(data):
pdict1 = {}
textspam = ""
textham = ""
list_WtV_spam=[]
list_WtV_ham=[]
for index, row in data.iterrows():
if row['Email']=='spam':
textspam = row['Description'] + " " + textspam
list_WtV_spam.append(row['Description'].split(" "))
else:
textham = row['Description'] + " " + textham
list_WtV_ham.append(row['Description'].split(" "))
pdict1.update({'spam': textspam})
pdict1.update({'ham': textham})
alldata = [pdict1,list_WtV_spam, list_WtV_ham ]
return alldata
def freqalltokens(palltext):
dictcounts = {}
palltext = palltext.split(" ")
for token in palltext:
if token in dictcounts:
dictcounts[token] = dictcounts[token] + 1
else:
dictcounts[token] = 1
sorted_val = sorted(dictcounts.items(), key=operator.itemgetter(1), reverse=True)
return sorted_val
def plotall(px, py):
plt.xticks(fontsize=6, rotation=90)
plt.ylabel('Frequency')
plt.plot(px, py)
plt.show()
def lexical_diversity(text):
info = []
info.append(len(text))
info.append(len(set(text)))
info.append(len(set(text))/len(text))
return info
count_spamham = []
sum_tokens=0
alltext = ""
complete_list = populatedictcorpus(raw_data)
dict1 = complete_list[0]
for key in dict1:
count_spamham.append([key, len(dict1[key])])
sum_tokens=len(dict1[key].split(" ")) + sum_tokens
alltext = alltext + dict1[key]
a=freqalltokens(alltext)
token = []
count = []
for item in a:
token.append(item[0])
count.append(item[1])
plotall(token, count)
for itemnum in range (len(count_spamham)):
print ("Number of tokens in:", count_spamham[itemnum][0], count_spamham[itemnum][1])
## Number of tokens in: spam 104334
## Number of tokens in: ham 349719
print ("Number of tokens in text:", sum_tokens)
## Number of tokens in text: 87537
lingstats = lexical_diversity(dict1['spam'] + " " + dict1['ham'])
print ("Total tokens:", lingstats[0])
## Total tokens: 454054
print ("Total Unique Words:", lingstats[1])
## Total Unique Words: 108
print("Type/Token Ratio:", round(lingstats[2], 6))
## Type/Token Ratio: 0.000238
#Spam Word cloud
def words_to_cloud (pstr):
wordcloud = WordCloud().generate(pstr)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
words_to_cloud (" ".join(dict1['spam'].split(" ")))
Lexical Diversity is “the range of different words used in a text, with a greater range indicating a higher diversity”
Imagine a text which keeps repeating the same few words again and again – for example: ‘manager‘, ‘thinks‘ and ‘finishes‘.
Compare this with a text which avoids that sort of repetition, and instead uses different vocabulary for the same ideas, ‘manager, boss, chief, head, leader‘, ‘thinks, deliberates, ponders, reflects‘.
The second text is likely to be more complex and more difficult. It is said to have more ‘Lexical diversity’ than the first text, and this is why Lexical Diversity (LD) is thought to be an important measure of text difficulty.
Type Token Ratio: the number of different words (types)/all words produced (tokens)
#### 3.6.5 Lexical Dispersion
The location of a word can be determined. It can be established for example how many words from the beginning it appears. This positional information can be displayed using a dispersion plot. Each stripe represents an instance of a word, and each row represents the entire text.
#dispersion
#spam_text_tokens = nltk.word_tokenize(dict1['spam']) #tokenize
spam_text_tokens = nltk.word_tokenize(dict1['spam']) #tokenize
spam_text_object = nltk.Text(spam_text_tokens) #turning it into nltk.Text object to be able to use .condordance, .similar etc
spam_text_object.dispersion_plot(["call", "service", "text"])
The frequency count of types that we did above is useful to a certain extent. In order to see what the frequency is all about we need to look at the types in context, that is, we need to make a concordance of the type in question. Making a concordance will put the word in the middle and show you what the surrounding text looks like.
Also known as keyword in context or KWIC.
allspamtokens = nltk.word_tokenize(dict1['spam']) #tokenize
spamtoken_object = nltk.Text(allspamtokens) #turning it into nltk.Text object to be able to use .condordance, .similar etc
spamtoken_object.concordance('call')
## Displaying 25 of 346 matches:
## £750 Pound prize . 2 claim is easy , call 087187272008 NOW1 ! Only 10p per min
## ER FROM O2 : To get 2.50 pounds free call credit and details of great offers p
## ows 800 un-redeemed S.I.M . points . Call 08718738001 Identifier Code : 49557
## are awarded a SiPix Digital Camera ! call 09061221061 from landline . Delivery
## ws 800 un-redeemed S. I. M. points . Call 08719899229 Identifier Code : 40411
## a FREE 8Ball wallpaper 2p per min to call Germany 08448350055 from your BT lin
## u have won a £1000 prize GUARANTEED Call 09064017295 Claim code K52 Valid 12h
## on £1000 cash or a Spanish holiday ! CALL NOW 09050000332 to claim . T & C : R
## 711 & first=true¡C C Ringtone¡ Txt : CALL to No : 86888 & claim your reward of
## test colour camera mobile for Free ! Call The Mobile Update Co FREE on 0800298
## shopping breaks from 45 per person ; call 0121 2025050 or visit www.shortbreak
## be even £1000 cash to claim ur award call free on 0800 ... .. ( 18+ ) . Its a
## ling ! Would your little ones like a call from Santa Xmas eve ? Call 090580945
## es like a call from Santa Xmas eve ? Call 09058094583 to book your time . You
## your time . You have 1 new message . Call 0207-083-6089 Free entry to the gr8p
## omer claims dept . Expires 13/4/04 . Call 08717507382 NOW ! A £400 XMAS REWARD
## mers to receive a £400 reward . Just call 09066380611 Camera - You are awarded
## are awarded a SiPix Digital Camera ! call 09061221066 fromm landline . Deliver
## landline . Delivery within 28 days . Call 09095350301 and send our girls into
## stacy . Just 60p/min . To stop texts call 08712460324 ( nat rate ) u r subscri
## mx3age16subscription Urgent ! Please call 09061213237 from landline . £5000 ca
## rded with a £2000 prize GUARANTEED . Call 09061790126 from land line . Claim 3
## Ltd Suite 373 London W1J 6HL Please call back if busy Urgent ! Please call 09
## se call back if busy Urgent ! Please call 09061213237 from a landline . £5000
## ervice ! To find out who it could be call from your mobile or landline 0906401
Words tend to appear in typical, recurrent combinations:
➢ day and night
➢ ring and bell
➢ milk and cow
➢ kick and bucket
➢ brush and teeth
➣ such pairs are called collocations (Firth, 1957)
➣ the meaning of a word is in part determined by its characteristic
“You shall know a word by the company it keeps!” (Firth, 1957)
Empirically, collocations are words that have a tendency to occur near each other.
Words do not randomly appear together. Some of those co-occurrence are extremely consistent and bear meaning with them. Collocation is important for us to look at when we study language, and it’s really the mass observation of co-occurrence in corpus data that allows us to begin to measure the extent to which words are coming together in order to form meaning.
def generate_collocations(tokens):
'''
Given list of tokens, return collocations.
'''
ignored_words = nltk.corpus.stopwords.words('english')
bigramFinder = nltk.collocations.BigramCollocationFinder.from_words(tokens)
bigramFinder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)
bigram_freq = bigramFinder.ngram_fd.items()
bigramFreqTable = pd.DataFrame(list(bigram_freq), columns=['bigram','freq']).sort_values(by='freq', ascending=False)
return bigramFreqTable
print (generate_collocations(dict1['spam'].split()))
## bigram freq
## 321 (Please, call) 26
## 352 (GUARANTEED., Call) 19
## 145 (£1000, cash) 19
## 351 (prize, GUARANTEED.) 19
## 354 (land, line.) 16
## 140 (Valid, 12hrs) 16
## 63 (Account, Statement) 16
## 131 (draw, shows) 15
## 698 (please, call) 14
## 564 (2nd, attempt) 14
## 613 (Call, MobileUpd8) 14
## 675 (customer, service) 14
## 62 (2003, Account) 13
## 71 (Identifier, Code:) 13
## 355 (line., Claim) 13
## 509 (every, week) 13
## 672 (guaranteed, £1000) 12
## 64 (shows, 800) 12
## 68 (points., Call) 12
## 65 (800, un-redeemed) 12
## 328 (await, collection.) 11
## 1280 (dating, service) 11
## 749 (live, operator) 10
## 220 (Free, entry) 10
## 828 (call, 08000930705) 10
## 758 (send, STOP) 10
## 937 (SAE, T&Cs) 10
## 676 (service, representative) 10
## 316 (txt, MUSIC) 9
## 659 (500, pounds) 9
## ... ... ...
## 2121 (Stop2, cancel) 1
## 2122 (cancel, Xmas) 1
## 2124 (Years, Eve) 1
## 2125 (Eve, tickets) 1
## 2102 (Cost, £1.50) 1
## 2101 (3UZ, Cost) 1
## 2100 (PoBox84, M26) 1
## 2099 (1st4Terms, PoBox84) 1
## 2074 ((to, bid) 1
## 2075 (bid, £10)) 1
## 2076 (83383., Good) 1
## 2077 (Good, luck.) 1
## 2078 (luck., Text) 1
## 2079 (Text, BANNEDUK) 1
## 2080 (see!, cost) 1
## 2084 (g696ga, 18+) 1
## 2085 (18+, XXX) 1
## 2086 (XXX, URGENT!) 1
## 2088 (Call, 09050000460) 1
## 2089 (Claim, J89.) 1
## 2090 (next, month) 1
## 2091 (month, get) 1
## 2092 (get, upto) 1
## 2093 (upto, 50%) 1
## 2094 (standard, network) 1
## 2095 (network, charge) 1
## 2096 (activate, Call) 1
## 2097 (Call, 9061100010) 1
## 2098 (Wire3.net, 1st4Terms) 1
## 4531 (rcv, Free) 1
##
## [4532 rows x 2 columns]
Keywords are those whose frequency is unusually high in comparison with some norm.
In order to identify significant differences between 2 corpora or 2 parts of a corpus , we often use a statistical measure called keyness .
Imagine two highly simplified corpora. Each contains only 3 different words cat, dog, and cow and has a total of 100 words. The frequency counts are as follows:
Corpus A: cat 52; dog 17; cow 31
Corpus B: cat 9; dog 40; cow 31
Cat and dog would be key, as they are distributed differently across the corpora, but cow would not as its distribution is the same. Put another way, cat and dog are distinguishing features of the corpora; cow is not.
Normally, we use a concordancing program like AntConc or WordSmith to calculate keyness for us. While we can let these programs do the mathematical heavy lifting, it’s important that we have a basic understanding of what these calculations are and what exactly they tell us.
There are 2 common methods for calculating distributional differences: a chi-squared test ( or χ² test) and log-likelihood.
This clip here shows how to perform keyness in AntConc (https://github.com/salihadfid1/NLP_INTRO/blob/master/Keyness%20in%20AntConc.zip “Github”)
Exercises:
Add preprcocessing to the spam/ham texts and then redo the wordclouds. Do you notice any changes? Hint: Use the clean data that you obtained from the previous Section.
Pick any 3 tokens from the spam/ham datset and calculate their normalised frequencies.
Intended Learning Outcomes: By the end of Chapter 4, you should
Describe Entity Recognition and N-grams and how to extract them from a corpus.
Be able to transform text data into numeric data using Bag-of-Words approach and One-Hot Encoding methods.
Apply kmeans technique from the sklearn library to a corpus
Describe TFIDF scoring methosd and how to apply it using the sklearn library
Describe word embeddings and how to extract them from a corpus.
Display text data using wordclouds.
Feature Representation is about applying feature engineering techniques to convert the text data into numeric data.
Machine Learning Techniques require numeric data to be able to process text.
NER tools separate entities into different classes. The category labels are PERSON, ORGANIZATION, and GPE (geopolitical entity).
example_document = 'I am flying to JFK in New York in December to visit the Statue of Liberty and Fifth Avenue'
document_tokens = nltk.word_tokenize(example_document)
document_tokens_with_part_of_speech_tag = nltk.pos_tag(document_tokens)
print(document_tokens_with_part_of_speech_tag)
## [('I', 'PRP'), ('am', 'VBP'), ('flying', 'VBG'), ('to', 'TO'), ('JFK', 'NNP'), ('in', 'IN'), ('New', 'NNP'), ('York', 'NNP'), ('in', 'IN'), ('December', 'NNP'), ('to', 'TO'), ('visit', 'VB'), ('the', 'DT'), ('Statue', 'NNP'), ('of', 'IN'), ('Liberty', 'NNP'), ('and', 'CC'), ('Fifth', 'NNP'), ('Avenue', 'NNP')]
entity_recognition = nltk.ne_chunk(document_tokens_with_part_of_speech_tag)
print(entity_recognition)
## (S
## I/PRP
## am/VBP
## flying/VBG
## to/TO
## (ORGANIZATION JFK/NNP)
## in/IN
## (GPE New/NNP York/NNP)
## in/IN
## December/NNP
## to/TO
## visit/VB
## the/DT
## Statue/NNP
## of/IN
## (ORGANIZATION Liberty/NNP)
## and/CC
## (PERSON Fifth/NNP Avenue/NNP))
Discussion: What do you think of the outcome?
Exercise: Use the same sentence with lowercase letters then test if nltk can still recognise everything.
N-grams is a sequence of characters or words. Character unigram consists of 1 character, character N-gram consists of N characters. The same applies for words. Word N-grams consists of a sequence of N-words. In the example below we use wordgrams.
number_of_ngrams = 2 #You can change n to get unigrams or more than two
example_sentence = 'I am flying to JFK in New York in December to visit the Statue of Liberty and Fifth Avenue'
n_grams_of_example_sentence = ngrams(nltk.word_tokenize(example_sentence), number_of_ngrams) #splitting the sentence in n-grams. Here n=2 ie bigrams.
for grams in n_grams_of_example_sentence:
print(grams)
## ('I', 'am')
## ('am', 'flying')
## ('flying', 'to')
## ('to', 'JFK')
## ('JFK', 'in')
## ('in', 'New')
## ('New', 'York')
## ('York', 'in')
## ('in', 'December')
## ('December', 'to')
## ('to', 'visit')
## ('visit', 'the')
## ('the', 'Statue')
## ('Statue', 'of')
## ('of', 'Liberty')
## ('Liberty', 'and')
## ('and', 'Fifth')
## ('Fifth', 'Avenue')
BOW does not consider grammar or word order. Suppose we have a sentence, BOW it measures the frequency of each word.
One-Hot Encoding is mapping the categorical values to integer values.
how_I_feel = ['happy', 'unhappy', 'unhappy', 'neutral', 'happy', 'happy']
encoded_feelings = pd.get_dummies(how_I_feel)
print(encoded_feelings)
## happy neutral unhappy
## 0 1 0 0
## 1 0 0 1
## 2 0 0 1
## 3 0 1 0
## 4 1 0 0
## 5 1 0 0
Exercise: What if we had 2 different categories? Do One-Hot Encoding with two categories using the simple way.
bag_of_word_example_sentence = ["to handle a language skillfully is to practice a kind of evocative sorcery", "Words are a pretext it is the inner bond that draws one person to another not words", "touch comes before sight, before speech it is the first language and the last and it always tells the truth","to learn a language is to have one more window from which to view the world"]
vectorizer = CountVectorizer() #creating the transformer
vectorized_example= vectorizer.fit_transform(bag_of_word_example_sentence) #tokenizing and building the vocaculary using the BOW_Example_Sentence
tdm = pd.DataFrame(vectorized_example.toarray(), columns = vectorizer.get_feature_names())
print(tdm)
## always and another are before ... view which window words world
## 0 0 0 0 0 0 ... 0 0 0 0 0
## 1 0 0 1 1 0 ... 0 0 0 2 0
## 2 1 2 0 0 2 ... 0 0 0 0 0
## 3 0 0 0 0 0 ... 1 1 1 0 1
##
## [4 rows x 42 columns]
Exercise:
Construct a few sentences and compute bigrams, unigrams from them.
Deploy CountVectorizer over the sentences and view the resultant matrix in a pandas dataframe.
Tf-idf stands for term frequency-inverse document frequency and the tf-idf weight is a weight often used in information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. So, words that are common in every document, such as this, what, and if, rank low even though they may appear many times, since they don’t mean much to that document in particular.However, if the word bug appears many times in a document, while not appearing many times in others, it probably means that it’s very relevant
Steps - Plot a wordcloud with TF-IDF:
Get sample of data from spam/ham dataset.
Clean it out, using some of the preprocessing functions.
Apply tfidfVectorizer() from scikit-learn library.
View term document matrix
Construct a word cloud from tfidf scores
print ("Sample data", samp_data)
## Sample data Email Description
## 0 spam Text82228>> Get more ringtones, logos and game...
## 1 spam Dorothy@kiefer.com (Bank of Granite issues Str...
## 2 spam Great NEW Offer - DOUBLE Mins & DOUBLE Txt on ...
## 3 spam lyricalladie(21/F) is inviting you to be her f...
## 4 spam Tone Club: Your subs has now expired 2 re-sub ...
## 5 ham Have you heard from this week?
## 6 ham Well I'm going to be an aunty!
## 7 ham Lol yes. But it will add some spice to your day.
## 8 ham Tell me again what your address is
## 9 ham I'm sorry. I've joined the league of people th...
alltext = " "
for index, row in samp_data.iterrows():
row['Description']=' '.join(preprocess(row['Description']))
alltext = row['Description'] + alltext
print ("Sample data", samp_data.head())
## Sample data Email Description
## 0 spam get more ringtones logos and games from questi...
## 1 spam dorothy bank of granite issues explosive pick ...
## 2 spam great new offer double mins double txt on best...
## 3 spam lyricalladie is inviting you to be her friend ...
## 4 spam tone club your subs has now expired reply mono...
vectorizer = TfidfVectorizer()
samp_data_vectorised = vectorizer.fit_transform(samp_data['Description'])
#if u want to look at it
tdm = pd.DataFrame(samp_data_vectorised.toarray(), columns = vectorizer.get_feature_names())
print (tdm)
## add address again all ... will yes you your
## 0 0.00000 0.000000 0.000000 0.000000 ... 0.00000 0.00000 0.000000 0.000000
## 1 0.00000 0.000000 0.000000 0.000000 ... 0.00000 0.00000 0.000000 0.000000
## 2 0.00000 0.000000 0.000000 0.000000 ... 0.00000 0.00000 0.000000 0.000000
## 3 0.00000 0.000000 0.000000 0.000000 ... 0.00000 0.00000 0.177954 0.000000
## 4 0.00000 0.000000 0.000000 0.000000 ... 0.00000 0.00000 0.000000 0.161024
## 5 0.00000 0.000000 0.000000 0.000000 ... 0.00000 0.00000 0.352809 0.000000
## 6 0.00000 0.000000 0.000000 0.000000 ... 0.00000 0.00000 0.000000 0.000000
## 7 0.31638 0.000000 0.000000 0.000000 ... 0.31638 0.31638 0.000000 0.235301
## 8 0.00000 0.414196 0.414196 0.000000 ... 0.00000 0.00000 0.000000 0.308050
## 9 0.00000 0.000000 0.000000 0.164544 ... 0.00000 0.00000 0.244752 0.000000
##
## [10 rows x 110 columns]
words_to_cloud (alltext)
K-Means is a very simple algorithm which clusters the data into K number of clusters. K-Means is widely used for many applications (Image Segmentation, News Article Clustering, Clustering Languages)
Further details can be found here: http://benalexkeen.com/k-means-clustering-in-python/ and just through googling.
from sklearn.cluster import KMeans
km = KMeans(n_clusters=2)
clusters = km.fit(tdm)
#Show counts per cluster number
print("Counts per Cluster", np.unique(clusters.labels_, return_counts=True))
#Check same number of documents returned
## Counts per Cluster (array([0, 1]), array([7, 3], dtype=int64))
print("Number of documents clustered", np.unique(clusters.labels_, return_counts=True)[1].sum())
#Show number of iterations of K-means
## Number of documents clustered 10
print("number of iterations: {0}".format(clusters.n_iter_))
#add the cluster number to each input iati record
## number of iterations: 2
samp_data['clusterresult']=clusters.labels_
print (samp_data)
## Email Description clusterresult
## 0 spam get more ringtones logos and games from questi... 0
## 1 spam dorothy bank of granite issues explosive pick ... 0
## 2 spam great new offer double mins double txt on best... 0
## 3 spam lyricalladie is inviting you to be her friend ... 1
## 4 spam tone club your subs has now expired reply mono... 0
## 5 ham have you heard from this week 0
## 6 ham well i going to be an aunty 1
## 7 ham lol yes but it will add some spice to your day 1
## 8 ham tell me again what your address is 0
## 9 ham i sorry i joined the league of people that don... 0
How do you make a computer understand that “Apple” in “Apple is a tasty fruit” is a fruit that can be eaten and not a company?
The answer to the above questions lie in creating a representation for words that capture their meanings, semantic relationships and the different types of contexts they are used in.
And all of these are implemented by using Word Embeddings or numerical representations of text so that computers may handle them.
They are a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.
Problem with one-hot representation.
Words are atomic symbols. All words vectors are orthogonal and equidistant
Goal: word vectors with a natural notion of similarity
For example: “hotel”, “motel”
Make use of distributional similarity (The meaning of a word is given by the context where it appears)
You can get a lot of value by representing a word by means of its neighbors
“You shall know a word by the company it keeps” J. R. Firth 1957: 11
One of the most successful ideas of modern statistical NLP.
You can vary whether you use local or large context to get a more syntactic or semantic clustering
Central idea: represent words by their context
Shift in Meaning
Word embeddings give us a way to use an efficient, dense representation in which similar words have a similar encoding.
How can we build simple, scalable, fast to train models which can run over billions of words that will produce exceedingly good word representations?
Word2Vec is one of the most popular technique to learn word embeddings using shallow neural network. It was developed by Tomas Mikolov in 2013 at Google.
Word2vec is the technique/model to produce word embedding for better word representation. It captures a large number of precise syntactic and semantic word relationship. It is a shallow two-layered neural network.
Words are represented in the form of vectors and placement is done in such a way that similar meaning words appear together and dissimilar words are located far away. This is also termed as a semantic relationship.
Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus.
Two different learning models were introduced that can be used as part of the word2vec approach to learn the word embedding; they are:
Continuous Bag-of-Words, or CBOW model.
Continuous Skip-Gram Model. The CBOW model learns the embedding by predicting the current word based on its context.
The continuous skip-gram model learns by predicting the surrounding words given a current word.
See more details here https://towardsdatascience.com/word-to-vectors-natural-language-processing-b253dd0b0817
matrix
Gensim is also open-source library for unsupervised topic modeling and natural language processing. Lets have a look and get some embeddings for our spamham corpus
###plot embeddings
complete_list = populatedictcorpus(raw_data)
#spam
model = Word2Vec(complete_list[1], min_count=20,size=50,workers=4)
# summarize the loaded model
print(model)
# summarize vocabulary
## Word2Vec(vocab=127, size=50, alpha=0.025)
words = list(model.wv.vocab)
print(words)
# access vector for one word
## ['Free', 'entry', 'in', '2', 'a', 'to', 'win', 'Text', 'receive', 'txt', 'been', 'now', 'and', 'you', 'for', 'it', 'network', 'customer', 'have', 'selected', 'prize', 'To', 'claim', 'call', 'Claim', 'Valid', 'your', 'mobile', 'or', 'U', 'the', 'latest', 'with', 'Call', 'The', 'Mobile', 'FREE', 'on', 'send', '16+', 'Reply', '4', 'URGENT!', 'You', 'won', '1', 'week', 'our', 'Txt', 'message', '-', 'ur', 'will', 'be', 'Please', 'by', 'reply', 'not', 'We', 'free', 'is', 'now!', 'all', '', 'I', 'that', 'of', 'are', 'awarded', 'UR', 'new', 'service', 'as', 'guaranteed', '£1000', 'cash', 'Your', 'text', 'Get', 'PO', 'Box', '16', 'contact', 'draw', 'shows', '150ppm', '4*', '£2000', 'This', 'from', 'u', 'know', 'get', 'any', 'For', '&', 'per', 'STOP', 'Send', 'only', 'out', '500', 'can', 'just', '18', 'who', 'so', 'NOW', 'me', 'at', 'stop', 'has', 'Just', 'this', 'weekly', 'number', 'Nokia', 'phone', '1st', 'Holiday', '2nd', 'attempt', 'an', 'every', 'CALL', '£100', '8007']
print(model['win'])
# save model
## [-0.03545928 0.25954574 0.06081898 0.15471287 0.08740773 -0.08031593
## 0.01088187 0.01909436 -0.28187066 -0.06735924 0.16190948 0.02756188
## 0.11397921 0.21526596 0.10324737 0.06251433 0.23609957 -0.09641105
## 0.100312 -0.10333646 -0.10575463 -0.04887298 0.06867211 -0.05616473
## 0.17632319 -0.15030345 0.08800747 0.29607433 0.06435648 0.23282269
## 0.00061679 -0.0465043 0.14393388 -0.16550934 0.0212473 0.20148276
## -0.07860402 0.02404218 -0.02530096 0.1008357 -0.38878825 0.09220345
## -0.21316203 -0.04620269 -0.10697421 0.0514056 0.04928547 -0.10309346
## -0.2262441 -0.2819984 ]
##
## C:\Users\s-minhas\AppData\Local\CONTIN~1\ANACON~1\python.exe:1: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)
#ham
## Word2Vec(vocab=127, size=50, alpha=0.025)
model2 = Word2Vec(complete_list[2], min_count=20,size=50,workers=4)
# summarize the loaded model
print(model2)
# summarize vocabulary
## Word2Vec(vocab=470, size=50, alpha=0.025)
words2 = list(model2.wv.vocab)
print(words2)
# access vector for one word
## ['until', 'only', 'in', 'n', 'great', 'e', 'there', 'got', 'Ok', 'wif', 'u', 'U', 'dun', 'say', 'so', 'early', 'c', 'already', 'then', 'I', "don't", 'think', 'he', 'goes', 'to', 'around', 'here', 'my', 'is', 'not', 'like', 'with', 'me.', 'They', 'me', 'As', 'your', 'has', 'been', 'as', 'for', 'all', 'friends', "I'm", 'gonna', 'be', 'home', 'soon', 'and', 'i', 'want', 'talk', 'about', 'this', 'stuff', "I've", 'enough', 'today.', 'the', 'right', 'you', 'wont', 'take', 'help', 'will', 'You', 'have', 'a', 'at', 'A', 'Oh', 'watching', 'remember', 'how', '2', 'his', 'Yes', 'He', 'v', 'make', 'if', 'way', 'its', 'b', 'Is', 'that', 'going', 'try', 'So', 'ü', 'pay', 'first', 'Then', 'when', 'da', 'finish', 'lunch', 'go', 'down', 'lor.', '3', 'ur', 'no', 'can', 'meet', 'up', 'Just', 'eat', 'really', 'This', 'getting', 'Lol', 'always', 'Did', 'bus', '?', 'Are', 'an', 'left', 'over', 'dinner', 'Do', 'feel', 'Love', 'back', '&', 'car', "I'll", 'let', 'know', 'room', 'What', 'it', "that's", 'still', 'were', 'sure', 'being', 'or', 'why', 'x', 'us', 'Yeah', 'was', 'had', 'out', 'she', 'that.', 'But', 'we', 'Not', 'doing', 'too', '', 'K', 'tell', 'anything', 'you.', 'of', 'just', 'look', 'msg', 'on', 'may', 'but', 'her', 'done', 'see', 'lor...', 'did', 'do', "i'm", 'trying', 'Pls', 'wanted', ',', 'need', 'you,', '...', 'most', 'love', 'sweet', 'YOU', 'hope', 'well', 'am', '<#>', 'No', 'get', "can't", 'could', 'ask', 'bit', "didn't", 'even', 'are', 'time', 'saw', 'half', 'tomorrow', 'morning', "he's", 'our', 'place', 'tonight', 'never', 'by', 'thought', 'it,', 'since', 'best', 'happy', 'sorry', 'more', 'what', 'now', 'Sorry,', 'call', 'later', 'Tell', 'where', 'Your', 'pick', 'home.', 'good', 'Its', 'Sorry', 'ok', 'come', 'now?', 'check', 'said', 'give', 'class', 'IM', 'AT', 'waiting', 'once', 'very', 'after', 'same', 'How', 'much', 'there.', 'hi', 'Yup', 'next', 'If', 'one', 'send', 'came', 'babe', 'another', 'late', 'means', 'any', 'y', 'buy', 'later.', 'work', 'abt', 'When', '-', 'Please', 'text', 'name', 'long', 'them', 'And', 'guess', 'something', 'says', 'life', 'lot', 'dear', 'Thanks', 'making', 'some', 'would', 'My', 'better', 'again', 'Dont', 'cos', 'new', 'Cos', 'special', 'Happy', 'She', '4', 'We', 'went', 'school', 'pls', 'Will', 'Ü', 'wat', 'Good', 'do.', 'sent', 'money', 'dont', 'R', 'ME', 'haf', "It's", 'him', 'Got', 'forgot', "you're", 'little', 'things', 'those', 'd', 'Gud', 'Can', 'ya', 'who', 'from', 'job', 'The', 'thk', 'Ok...', 'Ur', 'out.', 'without', 'tv', 'because', 'miss', 'day', 'Hi', 'which', 'also', 'free', 'liao...', 'coming', 'cant', '.', 'now.', 'Have', 'til', 'end', 'ok.', 'guys', '!', 'Haha', 'jus', 'people', 'keep', 'friend', 'It', 'stop', 'someone', 'able', 'every', 'Hope', 'hav', 'nice', 'Hey', ':)', '<DECIMAL>', 'dat', 'please', 'today', 'before', 'big', 'few', 'use', 'time.', 'called', 'run', 'than', 'Dear', 'Or', 'ill', 'Where', 'reach', 'That', 'told', 'into', 'face', 'watch', "it's", 'u.', 'everything', 'didnt', 'ready', 'night', 'care', 'da.', 'you?', 'other', 'week', "Don't", 'MY', 'Why', 'plan', 'smile', 'might', '1', 'it.', 'All', 'person', 'Ok.', 'last', 'im', 'r', 'hour', 'thats', 'phone', 'message', 'should', 'find', 'made', 'day.', 'they', 'number', 'Am', 'two', 'In', 'ever', '5', 'sleep', 'meeting', 'Well', 'Wat', 'wish', 'quite', 'minutes', 'leave', 'having', 'Was', 'actually', 'put', "i've", 'wanna', 'off', 'thing', 'den', 'mind', 'dis', 'tot', ':-)', 'wait', 'many', 'working', 'shit', 'heart', "That's", 'days', 'bad', 'lor', "i'll", 'IS', 'bring', 'Me', 'saying', 'wants', '*', 'makes', 'hear', 'guy', 'yet', 'wan', 'Now', 'till', 'THE', 'start', 'probably', 'between']
print(model2['guy'])
# save model
## [ 0.04357569 0.08350114 0.34095785 0.17164959 -0.12193359 -0.29131025
## 0.12047595 0.2601956 -0.01061336 0.16121843 -0.09908661 0.02721363
## 0.2669399 0.19444926 -0.09118461 0.04505999 0.07784031 0.14391433
## 0.06386554 0.14761075 -0.12448109 -0.08282301 0.01879397 -0.06944286
## -0.1715292 -0.11754421 0.04511677 0.25595978 0.06588431 0.05481798
## -0.04413901 -0.00869057 0.10135481 -0.12388479 0.01379756 0.0935097
## -0.18663733 0.05465748 -0.06963251 0.22749786 -0.2666086 0.1737071
## -0.33705413 -0.22256711 -0.14515302 0.08217499 0.04112445 0.17980036
## -0.11445044 -0.24587457]
##
## C:\Users\s-minhas\AppData\Local\CONTIN~1\ANACON~1\python.exe:1: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
model2.save('model2.bin')
# load model
new_model2 = Word2Vec.load('model2.bin')
print(new_model2)
# dimensionality reduction
## Word2Vec(vocab=470, size=50, alpha=0.025)
X = model[model.wv.vocab]
## C:\Users\s-minhas\AppData\Local\CONTIN~1\ANACON~1\python.exe:1: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
X2 = model2[model2.wv.vocab]
pca1 = PCA(n_components=2)
result = pca1.fit_transform(X)
pca2 = PCA(n_components=2)
result2 = pca2.fit_transform(X2)
# Create plot
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.scatter(result[:, 0], result[:, 1], c="red",s=5,label="spam")
ax.scatter(result2[:, 0], result2[:, 1], c="blue",s=5,label="ham")
plt.xlim(-0.50, 1.25)
plt.ylim(-0.04, 0.04)
plt.gcf().set_size_inches((10, 10))
words = list(model.wv.vocab)
for i, word in enumerate(words):
plt.annotate(word, xy=(result[i, 0], result[i, 1]))
words2 = list(model2.wv.vocab)
for i, word2 in enumerate(words2):
plt.annotate(word2, xy=(result2[i, 0], result2[i, 1]))
plt.title('Spam Ham Embeddings')
plt.legend(loc=2)
plt.show()
##separate
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(10,4), sharey=True, dpi=120)
# Plot
ax1.scatter(result[:, 0], result[:, 1], c="red",label="spam", s= 5)
ax2.scatter(result2[:, 0], result2[:, 1], c="blue",label="ham", s= 5)
# Title, X and Y labels, X and Y Lim
ax1.set_title('Spam Embeddings'); ax2.set_title('Ham Embeddings')
ax1.set_xlabel('X'); ax2.set_xlabel('X') # x label
ax1.set_ylabel('Y'); ax2.set_ylabel('Y') # y label
ax1.set_xlim(-0.50, 1.25) ; ax2.set_xlim(-0.50, 1.25) # x axis limits
ax1.set_ylim(-0.04, 0.04); ax2.set_ylim(-0.04, 0.04) # y axis limits
words = list(model.wv.vocab)
for i, word in enumerate(words):
ax1.annotate(word, xy=(result[i, 0], result[i, 1]), fontsize=5)
words2 = list(model2.wv.vocab)
for i, word2 in enumerate(words2):
ax2.annotate(word2, xy=(result2[i, 0], result2[i, 1]), fontsize=5)
ax1.legend(loc=2)
ax2.legend(loc=5)
# ax2.yaxis.set_ticks_position('none')
plt.tight_layout()
plt.show()
Exercises:
Note: This chapter is mostly derived from Dan Jurafsky’s slides available here https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf
Intended Learning Outcomes: By the end of Chapter 5, you should
Describe Language Modelling
Appreciate its usefulness to commerical language based applications
Be able take a sentence from a corpus and computes its probability
“Language modeling is the task of assigning a probability to sentences in a language. […]
Besides assigning a probability to each sequence of words, the language models also assigns
a probability for the likelihood of a given word (or a sequence of words) to follow
a sequence of words.”
— Page 105, Neural Network Methods in Natural Language Processing, 2017.
Language modeling is central to many important natural language processing tasks. For example:-
• Machine Translation P(high winds tonite) > P(large winds tonite) • Spell Correction The office is about fifteen minuets from my house P(about fiIeen minutes from) > P(about fiIeen minuets from) • Speech Recognition P(I saw a van) > P(eyes awe of an) • Summarization • Question Answering
Goal: compute the probability of a sentence or sequence of words:
P(W) = P(W1, W2,W3,W4,W5…WN)
Related task: Probability of an upcoming word:
P(W) = P(W5|W1, W2,W3,W4)
A way to tackle this is shown below:
Intended Learning Outcomes: By the end of Chapter 5, you would
feel confident to describe datasets,
discuss what the appropriate steps are to do text-preprocessing and some exploratory analysis (wordclouds)
You should be able to perform these steps in the appropriate order and communicate the results.
Import the Patent Dataset to Python
Understand/Describe the dataset
Think what we could do with this dataset
Perform the appropriate steps to extract meaningful outcomes from the dataset (do pre-processing using the clean_up_text()).
Plot a wordcloud with TF-IDF
What did we find
What it means/Communicate your results
Hint: To be able to plot the TF-IDF wordlcoud you will need to have a list of list of strings.
patent_data_abstract = patent_data["abstract"] #taking the abstract only
Patent_Data_Abstract is a pandas Series. To plot TF-IDF wordcloud we need to have a list of list of strings where each list of strings is considered as a separate document. Also, for the clean_up_text() we need to loop over and clean each tweet separately and join them while the end results is a list of list of strings. See the code below.
#changing from pandas Series to a list of lists of strings
#range(0,100) because this is how many abstracts we have
list_of_abstracts = []
for i in range(0,100):
list_of_abstracts.append(patent_data_abstract.iloc[[i]].tolist())
clean_patent_data=[]
#list_of_abstracts is a list of list of strings. I loop over each list, I join all words within that list into one string, I apply the clean_up_text function and then I put everything into a list. Note: the output of clean_up_text is a list
for each_list in list_of_abstracts:
patent_data_string = " ".join(each_list)
temporary_variable = clean_up_text(patent_data_string)
clean_patent_data.append(temporary_variable.split())
Import the Hep Dataset (High Energy Physics) to Python
Understand/Describe the dataset
Think what we could do with this dataset
Perform the appropriate steps to extract meaningful outcomes from the dataset (pre-processing). Hint: Use the function clean_up_text().
Use wordcloud with simple frequency
What did we find
Generate embeddings fro the abstracts
What it means/Communicate your results
Intended Learning Outcomes: Now, you should
feel confident to describe datasets,
discuss what the appropriate steps are to do text-preprocessing and exploratory analysis (wordclouds)
You should also be able to perform these steps in the appropriate order and communicate the results.
Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit, Steven Bird, Ewan Klein and Edward Loper, O’Reilly
Introduction to Natural Language Processing, Concepts and Fundamentals for Beginners, Michael Walker, AI Sciences
Hands-On Natural Language Processing with Python, A practical guide to applying deep learning architectures to your NLP applications, Rajesh Arumugan and Rajalingappaa Shanmugamani, Packt
Python Natural Language Processing, Advance Machine learning and deep learning techniques for natural language processing, Jalaj Thanaki, Packt
Speech and Language Processing (3rd ed. draft) Dan Jurafsky and James H. Martin Draft chapters in progress, October 16, 2019. PDF available at https://web.stanford.edu/~jurafsky/slp3/
Clustering and Dimensionality Reduction Algorithms
Topic Modelling
Text Classification
Sentiment Analysis
“Speech and Language Processing (3rd ed. draft)”, Dan Jurafsky and James H. Martin Draft chapters in progress, October 16, 2019
“Corpus Linguistics, Method, Theory and Practice”, Tony McEnery and Andrew Hardie (2012)
Figure 1 extracted from https://givemefluency.com/2016/05/01/great-way-to-maintain-your-languages/
Figure 2 extracted from https://www.youtube.com/watch?v=bzz1pFWAtMo
Figure 3 extracted from https://www.youtube.com/watch?v=GLBsvdaR_ow
Figure 4 extracted from https://www.youtube.com/watch?v=DF679Ks8ZR4
Figure 5 extracted from https://www.cs.bham.ac.uk/~pjh/sem1a5/pt2/pt2_intro_morphology.html
Figure 6 extracted from https://medium.com/@paulomalvar/pragmatics-the-last-frontier-9d64351eea6f
Figure 7 extracted from https://www.youtube.com/watch?v=zQ6gzQ5YZ8o&list=PLoROMvodv4rOFZnDyrlW3-nI7tMLtmiJZ&index=2&t=0s
Figure 8 extracted from https://www.youtube.com/watch?v=zQ6gzQ5YZ8o&list=PLoROMvodv4rOFZnDyrlW3-nI7tMLtmiJZ&index=2&t=0s
Figure 9 - McEnery, T. & Wilson, A. (2001). Corpus Linguistics
Figure 10 extracted from https://www.nltk.org/book/ch01.html
Figure 11 extracted from https://www.nltk.org/book/ch01.html
Figure 12 extracted from https://web.stanford.edu/~jurafsky/slp3/
Figure 13 extracted from https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e
Figure 14 extracted from https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e
Figure 15 extracted from https://medium.com/@numb3r303_59126/enriching-word-vectors-with-subword-information-9ebe771a059d
Figure 16 extracted from https://nlp.stanford.edu/projects/histwords/
Figure 17 extracted from https://nlp.stanford.edu/projects/histwords/
A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. The module re provides full support for regular expressions in Python.
re.match(pattern, string, flag = 0): checks for a match and only returns the first occurrence of the search pattern from the string
re.search(pattern, string, flag = 0): similarly, re.search searches the entire string but gives you back all of the matches rather than just the first one which you get from re.match
re.sub(pattern, replacement, string, max=0): substitute matched regular expressions eg. remove excess white space in a text string
There are a number of regular expression patterns to help match specific parts of text. For example:
. –> Matches any single character except newline
* –> Matches 0 or more occurrences of preceding expression
? –> Matches 0 or 1 occurrence of preceeding expression
+ –> matches 1 or more occurences of preceeding expression
^ –> matched beginning of a line
$ –> matches end of line
\d or [0-9] –> matches digits
\D –> matches non digits
[a-z] –> matches any lower case ASCII
A python style comment #.*$ will match 0 or more occurances of ‘#’ followed by any character until the end of the line. Regex can be quite powerful - but also a bit tricky and difficult to read at times!
Example
import re
example_string = 'Regex is the best!!!'
print(example_string)
# substitute the extra white space with the normal white space
## Regex is the best!!!
new_string = re.sub(' ',' ',example_string)
print(new_string)
## Regex is the best!!!
If you are interested more in regular expressions you can experiment yourself using: https://regexr.com
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is” … think of it like “there exists”)
FW foreign word
IN preposition/subordinating conjunction
JJ adjective ‘big’
JJR adjective, comparative ‘bigger’
JJS adjective, superlative ‘biggest’
LS list marker 1)
MD modal could, will
NN noun, singular ‘desk’
NNS noun plural ‘desks’
NNP proper noun, singular ‘Harrison’
NNPS proper noun, plural ‘Americans’
PDT predeterminer ‘all the kids’
POS possessive ending parent’s
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to go ‘to’ the store.
UH interjection errrrrrrrm
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when